Depletion probes

ABSTRACT

The invention provides sets of RNA depletion probes, short DNA oligos that hybridize along the length of a target RNA and mediate digestion of the target RNA by RNase H to remove super-abundant RNA molecules from a sample. Depletion probes according to the invention are designed foremost based on biochemistry and the biophysical properties of the probes so that all of the depletion probes of a set exhibit substantially uniform, consistent behavior in binding to a target RNA in a sample. Probes are principally designed to specific performance targets and biophysical properties, yielding probe sets with irregular, even apparently random, spacing along a target RNA molecule.

TECHNICAL FIELD

The disclosure relates to depletion probes.

BACKGROUND

The transcriptome refers to the RNA transcripts of a cell or organism ata given time. Knowledge of the transcriptome reveals active cellularprocesses and can provide information about cell regulation, growth anddysfunction.

Transcriptome analysis typically involves microarray technology and,more commonly, next-generation sequencing technologies (e.g., RNA-Seq).Common objectives of RNA-Seq are to detect all of the diversetranscripts present, including mRNA and non-coding RNA, as well as todetect splice variants, mutations, mobile genetic elements, andexpression levels during various stages of development or under variousconditions. Transcriptomics and RNA-Seq have application in diagnostics,disease profiling, pathogen detection, evolutionary biology, and otherareas of research. For example, RNA-Seq can potentially identify genesinvolved in resistance to environmental stresses, such as droughtresistance in crops. In another example, transcriptomic profiling canprovide information on mechanisms of drug resistance, potentiallyrevealing strategies for combating hospital-acquiredantibiotic-resistant infections.

SUMMARY

The invention provides sets of RNA depletion probes, short DNA oligosthat hybridize along the length of an RNA to be depleted from a sample.The DNA oligos mediate digestion of the target RNA by RNase H. Theinvention is useful to remove non-target RNA from a sample in order topromote capture of low abundance RNA or simply to achieve a higher yieldof target RNA. Depletion probes according to the invention are designedbased on the biochemistry and biophysical properties of probe/targetinteraction so that all of the depletion probes of a set exhibituniform, consistent behavior. For example, the depletion probes may bedesigned to have a melting temperature (Tm) within a narrow, pre-definedrange. Probes according to the invention are designed to achieveimproved performance, often yielding probe sets with irregular, evenapparently random, spacing along an RNA that is to be removed from asample.

Regardless of the irregular spacing, probe sets of the inventionoutperform prior art probe sets due to the rational biology-based designthat models Tm and screens reference RNA sequence data to omit candidateprobe sequences with off-target matches. Prior art probe sets sacrificecomplete deletion in order to achieve uniform probe binding along an RNAto be removed. Due to the limiting assumption that probe sets must tilealong the length of the RNA with uniform spacing and meltingtemperature, a number of the probes bind poorly due to factors such asGC content, secondary structure, or length. The invention addressesthose problems by using a design methodology that works within aselected window of melting temperatures and reduced off-target binding.

Probes of the invention form heteroduplexes along one or more RNAmolecules with unpredictable, apparently random spacing. In fact, probesets of the invention may leave long gaps of RNA—in some cases, gaps ofup to more than about 100 bases—yet still outperform prior art probesets designed to follow some simple pattern of short, regular gaps alongthe target RNA. In addition, probes of the invention show improveddepletion even in highly-degraded samples. Another advantage of probedesign according to the invention is that it allows for probehybridization and RNA digestion to occur in a single, rapid step asopposed to the use of heating/cooling cycles followed by a lengthy andsuboptimal digestion temperature. Because probe sets of the inventionuse probes designed to exploit the actual sequence content andbiophysical properties of heteroduplex formation, RNA depletion usingprobe sets of the invention reliably removes superabundant RNAs from asample. Because such tools may be used to reliably remove RNAs, such asribosomal RNAs and/or globin transcripts, RNA-Seq assays using probesets of the invention better detect the remaining, and sometimes rare,mRNA transcripts. Because rare genetic events are markers of criticalconditions, such as cancer or pathogenic infection, probe sets of theinvention are useful for the early and accurate detection of criticalbiological information.

In certain aspects, the invention provides methods for depleting RNA.Preferred methods include hybridizing a plurality of DNA oligos to atarget RNA molecule in a sample to form heteroduplexes, wherein the DNAoligos have sequences selected to (i) give the heteroduplexes meltingtemperatures within a predetermined range, and (ii) minimize matches toreference off-target RNAs. Methods of the invention further includedigesting RNA in the heteroduplexes. The heteroduplexes may be formedwith a non-uniform distribution along the RNA molecule. The DNA oligosmay include low GC oligos and high GC oligos that have higher meltingtemperatures than the low GC oligos. The DNA oligos may include low GColigos and high GC oligos that are shorter than the low GC oligos. TheDNA oligos may be selected to minimize spacing between adjacentheteroduplexes and may be present in non-uniform concentrations.

In some embodiments, DNA oligos are selected by: generating a set ofcandidate sequences complementary to a target RNA (in this case, the RNAto be removed) with predicted melting temperatures within apredetermined range; assigning a cost function to each of the candidatesequences based at least in part on a match score to the referenceoff-target RNAs; and selecting a set of the candidate sequences that mapto positions along the RNA molecule, wherein the set is selected by analgorithm that minimizes cumulative cost function, inter-position gaps,and/or overlap. The DNA sequences may be selected by a computer systemto provide DNA oligos that each uniquely anneal to the RNA molecule withsimilar annealing temperatures, with minimal or no overlap, and withminimal or no annealing to off-target RNA. Optionally the computersystem further selects the sequences to provide the DNA oligos with aminimum length and/or to minimize gaps between heteroduplexes.

In some embodiments, DNA oligos are tested and shown not to hybridizestably to protein coding transcripts in a standardized human referenceRNA extract. The target RNA molecule may include any of a ribosomal RNA,a globin transcript, or a mitochondrial RNA. Preferred methods includeperforming the recited steps to digest ribosomal RNA and globintranscripts from the sample. The digesting step may include treating thesample with RNAseH. In addition, preferred methods may include, afterthe digesting step, reverse transcribing non-target RNA molecules fromthe sample. In preferred embodiments, the heteroduplexes are formed atpositions along the RNA molecule that do not form any regular,repeating, or uniform pattern or spacing. E.g., spacing between theheteroduplexes may be highly variable and/mathematically random.Positioning of the heteroduplexes may leave one or more uncoveredregions of the RNA molecule of at least about 10 to about 100 bases.Moreover, coverage need not be end to end, meaning that 5′ and 3′ endsequences may be uncovered.

Methods of the invention may increase a ratio of poly-A tailed RNA tonon-coding RNA in the sample. Sequences of at least two of the DNAoligos are selected to hybridize to the RNA to be depleted at locationsthat overlap or abut one another. DNA oligos used in the invention mayhave lengths that vary from about 20 bases to about 40 bases (e.g., varybetween 18 and 44 inclusive). The DNA oligos may be present in the mixat non-equimolar concentrations. For example, based on empiricalobservations, some of the probes may be boosted, e.g., doubled inconcentration.

Aspects of the invention provide a set of RNA depletion probes. Anexemplary set includes a plurality of DNA oligos that hybridize to anRNA to be depleted to form heteroduplexes, wherein the DNA oligos havesequences selected to (i) give the heteroduplexes melting temperatureswithin a predetermined range, and (ii) minimize matching to off-targetRNA. Preferably the heteroduplexes are formed with a non-uniformdistribution along the RNA molecule. Spacing between the heteroduplexesin the set may be highly variable and/or mathematically random. The DNAoligos may be selected to minimize spacing between adjacentheteroduplexes. The DNA oligos may include both low GC oligos and highGC oligos (that have higher melting temperatures than the low GC oligosand/or are shorter than the low GC oligos). The probe set may bedesigned by generating a set of candidate sequences complementary to theRNA molecule with predicted melting temperatures within thepredetermined range; assigning a cost function to each of the candidatesequences based at least in part on a match score to the referenceoff-target RNAs; and selecting a set of the candidate sequences that mapto positions along the RNA molecule, wherein the set is selected by analgorithm that minimizes cumulative cost function, inter-position gaps,and/or overlap. The probe set may be designed to deplete any ofribosomal RNA, mitochondrial RNA, globin transcripts, or others.Notably, the heteroduplexes may form at positions along the RNA moleculethat do not form any regular, repeating, or uniform pattern or spacing.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 diagrams a method of probe design according to certainembodiments.

FIG. 2 shows steps of a method of probe selection.

FIG. 3 shows a distribution of inter-probe gap sizes.

FIG. 4 is a histogram of gap sizes.

FIG. 5 is a histogram of probe lengths.

FIG. 6 diagram a Bloom filter design methodology.

FIG. 7 describes an example of a method of probe design.

FIG. 8 shows ribosomal RNA depletion from whole blood.

FIG. 9 shows the result of hemoglobin RNA depletion from whole blood.

FIG. 10 shows rRNA depletion from formalin-fixed, paraffin embedded(FFPE) tissue.

FIG. 11 shows rRNA depletion from universal mouse and rat references.

DETAILED DESCRIPTION

The disclosure relates to methods for designing and using RNA depletionprobes as well as probe sets designed according to such methods. Methodsand materials of the disclosure are useful to remove species of RNAmolecules from a sample. Methods of the invention have particularapplications in transcriptomics and certain RNA-based workflows intendedto identify or quantify certain populations of RNA molecules. Forexample, it is often the case that there is a need to isolate specificRNA molecules, e.g. mRNA, in a sample. Depletion probes of the inventionare useful to deplete, for example, ribosomal catalytic RNA (that do notthemselves code for proteins) that might interfere with detection of thetarget mRNA in the sample. As an example, one approach to isolating andevaluating mRNA is to use poly-A-specific priming with oligo-dT as oneapproach to avoiding rRNA molecules. However, the rRNA is still presentand interfering and there are a variety of workflows that benefit fromusing target capture other than poly-T primer hybridization to mRNApoly-A tails. Moreover, partially-degraded samples, such asformalin-fixed paraffin embedded (FFPE) samples frequently haveseparation of the 3′ poly-A tail from the 5′ end of the transcript,which makes information from the 5′ ends inaccessible with techniquesrelying on poly-A enrichment. In addition, certain RNA species, such aslong non-coding RNA (lncRNA) do not contain poly-A tails. In suchexamples, the in ability to deplete non-target RNA makes detection oftarget more difficult.

Methods and compositions of the invention provide tools for depletingany RNA species out of a sample. RNA depletion according to embodimentsof the invention operates by the use of “depletion probes” or DNA oligosdesigned to hybridize to the RNA molecules to be depleted. The DNAoligos anneal to the RNA to form heteroduplexes. RNase H recognizesDNA-RNA heteroduplexes and cleaves the RNA in a heteroduplex. RNAdepletion using RNase H is appealing because it does not require thatthe RNA to be preserved contains poly-A tails and is indifferent to thepresence of poly-A tails on the RNA molecule targeted for depletion.Thus, even arbitrary mRNA transcript, such as a globin may be depletedby RNase H depletion. In addition, methods of the invention permitlncRNA and mRNA enrichment even in highly-degraded samples.

According to the invention, RNase H depletion utilizes depletion probes,short DNA oligos that hybridize to the RNA to be depleted, formingheteroduplexes. Some conventional approaches use a set of depletionprobes that tile along an RNA molecule. The use of tiled probes isappealing due to the possibility that the entire target molecule iscovered. Other approaches have been suggested that allow certain spacingbetween probes, such as leaving an inter-probe gap of some specificnumber of bases between each adjacent pair of probe targets. Moreover,some depletion probe sets may include probes that are described asoverlapping one another, apparently offering super-coverage (e.g.,coverage >1). In such description, “overlap” is a recognized term butthe word choice is misleading to the extent it describes two probeshybridizing to a single target molecule with one of them leaving anun-bound single-stranded end where the other probe is fully bound. Inactual use it is just as possible that each of the so-called overlappingprobes will find independent copies the same type of target RNA moleculeand each fully bind to one molecule. Here, the word overlap (and as usedherein) simply acknowledges that the DNA oligos are designed againstcomplementary regions where, for two oligos, it is their cognate targetsthat overlap.

Regardless of the probe tiling strategy, whether abutting, uniformlyspaced, or overlapping, existing probe designs reveal a design paradigmin which one decides which coverage pattern appears best suited andapplies that coverage pattern to cover the molecule to be depleted. Thecoverage pattern may mandate that the probe targets overlap, abut, orhave regular spacing. Once the pattern is selected, it is applied to themolecule to design the probes, which are synthesized and used for RNAdepletion.

The invention recognizes that conventional probe coverage strategiescreate laboratory reagents that impose limitations on their use inanalytical assays. For example, due to the variety in GC content, or dueto RNA secondary structure, existing depletion probe sets can be usedonly under very limited conditions. Also, optimizing the reactiontemperature for a majority of the probes means that the reaction isperformed at a temperature too high or too low for some of the probes.Conventional probe coverage strategy is not well suited to potentialregions of sequence similarity between the RNA to be depleted and theRNA of interest.

The invention addresses such limitations. The invention provides methodsof probe design that rely first on biophysical properties of themolecules being deleted and those being preserved. Methods of theinvention use sequence data for RNA molecules and also use reference RNAdata, and apply modeling to predict probes that hybridize to target RNAwithout marking for destruction the mRNA to be preserved.

Probes are designed on the basis of biophysical properties. Methods andcompositions of the invention use sequence information, GC richness,reference sequences, and modeled/predicted biochemistry behaviors togenerate and select probe sets that perform well, with uniform,consistent hybridization throughout the probe set and also withoutoff-target hybridization. A significant insight of the invention is thatprobe design does not require a spatial pattern that is projectedindiscriminately onto the length of an RNA molecule. Instead, desiredproperties, such as a range of melting temperatures, are established andthen sets of probes are designed having those desired properties. Forexample, a meta-set comprising all possible probes with the desiredproperties may be generated algorithmically, e.g., from sequence data,and then that meta-set may be reduced based on mapping, filtering,and/or scoring to select a smaller but useful set of probes. Theselected set of useful probes is synthesized as DNA oligos and used todeplete the selected RNA from a sample, thus enriching for the RNAof-interest. Significantly, a reference data set of sequences of thenon-target RNA may be used in, e.g., a filtering or scoring step, toremove probes from the meta set that would deplete possibly scarce mRNAsof interest from the sample.

In initial steps of generating all or a substantial number of possibleprobes, some embodiments start by establishing a nearly uniform Tm (orTm range) between any probe and its target, and also aim to minimize theTm between each probe and any potential off-targets.

A number of approaches may be used to arrive at the file set. Theselected probes preferably bind across the target RNA. Algorithms may beused to seek a probe set that binds across all regions of the target(e.g., 45S, mitochondrial rRNA, or globin transcripts). Certainembodiments use graph-based algorithms that seek to minimize (with nospecific constraints) the spacing between probes, and the potential Tmwith off-targets (e.g., possibly estimated by percent homology with thestrongest non-target BLAST hit). Other target spanning algorithms may beused to arrive at a set of probes. A probe set may be supplemented inpart one or more additional rounds of similar design, and/or by handbased on empirical data. For example, some probes may be “boosted” inconcentration to improve performance, e.g., based on results in initialtest assays.

Using such methodologies, the invention provides methods of makingdepletion probes, sets of the depletion probes, and methods of using thedepletion probes RNaseH-based depletion of rRNA and other highexpression RNAs from, for example, RNA-Seq libraries. Any suitableanalytical package or algorithm may be used in probe design. Preferredembodiments generate the candidate probes in a manner that is informedby sequence information of the target as well as a set of reference RNAsequence information for the sample or organism (as an interestingaside, it does not matter if the algorithm uses a reference RNA sequencedata set that also includes the target). The disclosure provides probedesign methods, and probe sets that are designed by methods that aresystematic and rational and that allow rapid design iteration tooptimize performance of the probes set.

At least three fundamental approaches to designing probes arecontemplated. A first approach makes use of Bloom filters foridentification of unique k-mers (Bloom filtered). A second approachgenerates a set of probes with balanced melting temperature (Tm) that isfiltered by a basic local alignment search tool (e.g., BLAST or relatedalgorithms) to avoid creating probes that deplete non-target RNAs(BLAST-filtered). A third approach generates a set of probes withbalanced melting temperature (Tm) then uses BLAST to score probes thenselects a probe set that traverses the target RNA while minimizing score(BLAST-scored).

In these approaches, potential probes may be screened for uniquenessagainst RefSeq human RNA (manually masked for targeted sequences). Incertain exemplary embodiments, RNA28SNx and RNA18SNx were used asdevelopment models. Probes against other RNAs, such as 5S andmitochondrial rRNA probes, may be designed similarly. After the initialset of probes are generated, a final candidate set that covers thetarget may be selected, e.g., using bioinformatics programmingtechniques known to the skilled artisan. Preferred embodiments employthe BLAST-scored methodologies.

FIG. 1 diagrams a BLAST-scored method 101 of probe design according tocertain embodiments. The method begins by selecting a target andobtaining sequence data. This can be performed by accessing a publiclyavailable database of genetic information such as GenBank or Ensemble.For example, the method 101 may be used to deplete the Homo sapiens 18Sribosomal N3 RNA from a biological sample. A computer system of theinvention may be used to retrieve NCBI reference sequence NR_146152.1. Asoftware package such as BioPerl or BioRuby may be used to importsequence data, such as the RNA18SN3 FASTA file for 18S ribosomal RNA N2stored at GenBank under NCBI reference sequence NR_146152.1. This filemay be brought into the system and set as the target. The system may beoperated to bring in multiple targets (e.g., essentially simultaneously,e.g., in one “run”). Preferred embodiments set at least two or three RNAmolecules each as a target, such as the longest ribosomal RNAs plus aglobin transcript.

Having set the target, the method 101 involves setting boundaries formelting temperature. This may be done by setting a target temperatureand then allowing a range, such as 50 degrees plus or minus 3. Thesystem then examines the target file to generate all possible stringsthat, if synthesized as DNA oligos, would have a Tm within bounds. Inone straightforward embodiment, a software package on the computersystem begins at the 5′ end of the target file and reads each possiblek-mer, calculating tm for each. Any suitable calculation may beperformed. In some embodiments, the system uses [4(G+C)+2(A+T)]. Toillustrate with a trivial example, NR_146152.1 begins with a T. So thefirst possible k-mer is a 1-mer consisting of T. The calculation returns2, which is out of bounds (not within 3 degrees of 50), so the systemdoes not write the k-mer to a file. The system then reads the 5′ 2-mer(which is TA), and so on. Every string that yields a T m within boundsis written to a file, and then the system moves to position 2 in thetarget file.

For this stage it may be preferred to have established a minimum (and/ormaximum) length. The system iterates over the target file and writes allpossible k-mers with Tm in bounds to a target file, e.g., all in-boundsprobes may be written to a FASTA file.

The method may proceed to BLAST each probe against reference RNAsequence data. In one straightforward example, BLAST scores are used tosimply omit probes that match non-target RNA above some threshold. Incertain embodiments, the BLAST result is used to score the probes (e.g.,[0 . . . 1]), where the score will be used as a cost in a cost function.The computer system performing the operation may preferably useblastn-short.

The BLAST score may be used to filter probes out or to assign a score toprobes that is used by an algorithm that selects a set of probes tocover the target RNA.

The algorithmic task for the computer system is to cover the target RNA(for convenience, from 5′ end to 3′ end), a task that can be presentedto the computer as traversing a graph where the probes are nodes and anyinter-probe gaps are edges (or, trivially, vice-versa). The systemattempts to traverse the graph, using any score as a cost function forusing that probe. This may be performed using a version of the Dijsktraalgorithm in which edge weights become the “distance” between twonodes+the “cost” for visiting the target node. In some embodiments, anedge weight includes a gap opening value and a separate gap extensionvalue. In some other embodiments the cost of visiting a node may be thebased its similarity to potential non-target RNAs.

FIG. 2 diagrams steps of a method of probe selection. In the top panel,all probes have been generated, e.g., a GenBank file is read for allstrings of certain length calculated to have a Tm within bounds, whichstrings are written to a FASTA file. In the diagram, each rectangle is acandidate probe, a sequence from the GenBank file. The hatch-markedblocks are the probes that will be eliminated by BLAST comparison toreference RNA sequence data. The reference matching probes areeliminated from consideration (directly removed in BLAST filter,effectively avoided in BLAST score), and an algorithm seeks to constructa directed graph through the probes. Any suitable algorithm may be used.Certain embodiments use a graph traversal algorithm that essentiallylooks for a directed acyclic graph with an optimized cost, where nodesand edges are assigned some cost. In the BLAST scoring embodiments,reference-matched probes (hatch-marked in FIG. 2 ) end up having arelatively high cost. Because edges may have costs proportional tolength, the algorithm may seek to minimize gaps (as drawn, the rectangle“probes” are nodes and the arrows are edges). The nodes along the foundpath are the probe set that is selected. In this algorithm, the selectedprobe set offers non-overlapping coverage of the target RNA molecule.

Graph algorithms are known in the art and conducive to sequence analysisdue to a natural fit between the linear nature of genetic informationcarried in nucleic acid sequences and the linear nature of a paththrough a directed graph. Graph platforms (such as Neo4J) are appealinghere because they cross over a software/hardware barrier. Softwarepackages can read from, write to, and query (e.g., traverse) data storedas graphs, but the graph storage makes non-standard use of computerhardware. For example, some graph platforms use index-free adjacency inwhich a graph query reads nodes, but the nodes contain references to thephysical locations in the physical memory at which adjacent nodes arewritten so that traversal moves to the referenced location in hardware.This is in comparison to relational databases that use lookup indextables written in familiar software to document relationships, withindex-free adjacency using spatial addresses on hardware in such a wayas to be much faster than index tables. Because algorithms of thedisclosure are conducive to implementation in graph platforms, methodsof the disclosure potentially run very fast and can scale up to queryany arbitrary sized reference target and BLAST the FASTA probes againstany arbitrary sized reference set without any difficulty.

In exemplary embodiments, a cost function to traverse the graph was setto k*score {circumflex over ( )}r (k, r parameters can be specified fromthe command line but default to k=256, r=2). Effectively, k is equal tothe distance penalty that would be paid by leaving a gap of sqrt(k)bases when score=1 (worst possible scenario). In the cost function, rdefines the rate at which the score penalty decreases, such that largervalues of r mean that small differences/small amounts of mismatch aretreated more liberally. The graph traversal minimizes the costs functionand the visited nodes become the probe set. The probes associated with agraph traversal with an optimal score (not necessarily absolute minimum)are used.

FIG. 3 shows a distribution of inter-probe gap sizes for differentvalues of k in the cost function. Increasing values of k appear todecrease overall BLAST scores of probes, with a potential tradeoff inoverall coverage. Probe sets may be reviewed manually. For example, abiologist may select k=1000 because any matches to off-target RNA getvery expensive scores, so the resulting probe set will not deplete raremRNAs, even though it may leave a 28 base gap (final number of thek=1000 line in the figure) of ribosomal RNA un-depleted after RNase Hdigestion.

Thus the software package initially parses the target sequence togenerate a set of probes (candidate sequences complementary to the RNAmolecule within predicting melting temperature boundaries), BLASTs theprobes to assign a cost function to each of the candidate sequencesbased at least in part on a match score to the reference off-targetRNAs, and find a cost-minimizing directed graph through the probes(candidate sequences) to thereby select a set of probes that map topositions along the RNA molecule. As discussed, the set is selected byan algorithm that minimizes cumulative cost function, inter-positiongaps, and overlap. The described algorithm does not literally minimizeoverlap, but simply avoids it. Other algorithms are within the scope ofthe disclosure. In fact, overlap does not necessarily need to be avoidedas it is not to be assumed that so-called overlap probes actuallycompete for binding to the same one copy of a molecule. Using the method101, the probes are selected by a computer system to provide DNA oligosthat each uniquely anneal to the RNA molecule, preferably with minimalor no overlap, and with minimal or no annealing to off-target RNA.

A second method of probe design, BLAST-filter, is also within the scopeof the invention. In BLAST filter embodiments, a computer system may beused to first generate all probes within a given Tm window using, e.g.,an input target RNA sequence file and a Perl script. In any embodiments,the probe length may be set to have a default, e.g., 21, but that canalso be varied, e.g., a different value can be used, the value could beswitched on-the-fly, or a range of values may be set. Similarly,temperature parameters may be varied. E.g., some embodiments use ascript that permits a 3.0 degree variance around an initial value (i.e.if the min and max length don't permit generating a probe with the Tminitial value, then script will generate the next closest candidate thatis within 3.0 degrees—this can be adjusted or eliminated). To generateall probes, they are designed by finding the first probe of desired Tmstarting from every base from the 5′. An additional strategy useful inany embodiments to ensure nothing is lost at the 3′ end, has the scriptalso produce probes from the 3′ end 250 bases in. In various preferredembodiments, the script uniquifies the initially generated probe set.

In this BLAST-filter embodiment, the system will BLAST probes withblastn-short against RefSeq—target sequences. Filter out probes with apercent identity above a given threshold. Results show that thisstrategy gives results with some similarity to the Bloom filtermethodologies. These BLAST-filtering methodologies may be used togenerate probes with the strictest possible identity criteria.

Any suitable approach may be used to generate a probe set. An insight ofthe invention is that information about biochemical properties of targetand off-target RNA is used first, to generate a probe set that willactually perform the intended task of depleting target RNA and leavingoff-target RNA. Having generated probes according to biophysicalproperties, then the practitioner may apply other criteria (e.g., a costfunction to the Dijsktra algorithm that minimizes gaps) that promoteresults similar to the chief purpose of prior art probe designstrategies (no gaps). Any suitable approach that uses biophysicalproperties may be used, such as the Bloom-filter, BLAST-filter, orBLAST-score methods presented here. Methods of the invention provideprobe sets that tend to have certain characteristic features.

Certain features of probe sets designed according to methods of theinvention are characteristic and worthy of note. For example, probespacing is highly variable and appears random (sequence of gaps below).Some probes abut one or more other probes (i.e. 5′ end touches 3′ end ofanother probe and/or 3′ end touches 5′ end of another probe). Someprobes overlap another probe (note comments elsewhere about overlap).Many gaps exceed 30 nucleotides (max=111, see FIG. 4 ). Probes havevariable length (min=18, max=44, see FIG. 5 ). The Tm of the probes arelargely uniform, mostly falling into two discrete groups (in general wefound that probes targeting GC-rich segments needed to have a higher Tmthan those targeting more balanced segments). Coverage of all targets isnot end-to-end. In fact coverage of the 45S rRNA is missing 5nucleotides at the 5′ end and 47 nucleotides at the 3′ end. Probes arenot present in the mix at equimolar concentrations—some are boosted upto 10× to improve performance (based on empirical data).

Some of those features are contrary to accepted wisdom about depletionprobes. For example, in prior art probe designs, gaps of 30 and even 111bases are rare and thought to be something to be avoided. Additionally,un-ordered mixtures of gaps, abutments, and overlaps were treated assomething to avoid. Significantly, uniformity of length is sacrificedand probes of the invention have variable length and substantiallyuniform melting temperature. Because probes of the disclosure aredesigned first from biophysical properties, they outperform prior artprobes (e.g., substantially uniform Tm ensures consistent digestion byRNase H because intended heteroduplexes will all be stable under similarconditions whereas in prior-art strict tiling strategies, the probes andtargets in AT-rich regions may melt apart). A consequence of the designparadigm is that probe spacing is highly variable and appears random.

Strategies of the disclosure have distinct advantages. For example,there is a reduced need to denature targets and/or probes—allowing areduction in workflow. The invention eliminates need to add enzyme whilereaction mix is being heated—likely due to more efficient denaturationof off-target probe binding. Together, those reductions cut workflowtime down by half compared to competitors, with no apparent loss in dataquality.

FIG. 4 is a histogram of gap sizes on that were determined for onespecific RNA. As discussed, probe spacing is highly variable and appearsrandom. A set of probes was designed against the specific RNA. Thefollowing is the sequence of inter-probe gap sizes where the probes formheteroduplexes with the specific RNA (negative values indicate anoverlap): 7, 10, 10, 10, 1, 9, 29, 18, 14, 15, 3, 21, 0, 0, 0, 7, 12,17, 2, 1, 2, 3, 2, 33, 6, 18, 16, 34, 6, 9, 11, 2, 19, 3, 20, 11, 1, 6,18, 12, 10, 13, 11, 13, 17, 6, 6, 8, 16, 14, 19, 10, 0, 11, 8, 2, 14, 5,7, 8, 3, 9, 11, 21, 0, 3, 17, 12, 7, 13, 14, 15, 18, 1, 1, 0, 2, 3, 11,4, 13, 18, 19, 6, 5, 5, 14, 1, 0, 0, 18, 19, 13, 15, 6, 12, 3, 111, 1,1, 46, 21, 12, 12, 8, 11, 12, 9, 0, 1, 7, 3, 2, 0, 14, 9, 0, 7, 7, 20,0, 7, 1, 3, 1, 1, 0, 0, 1, 0, 10, 3, 3, 26, 4, 11, 9, 10, 1, 2, 1, 5, 0,0, 5, 19, 7, 17, 10, 5, 6, 0, 3, 8, 5, 0, 6, 6, 6, 7, 13, 5, 29, 0, 21,29, 3, 7, 5, 12, 7, 8, 12, 0, 9, 4, 7, 9, 5, 34, 3, 13, 10, 9, 9, 4, 22,36, 15, 26, 29, 8, 14, 17,4, 44, 12, 11, 47, 9, 10, 9, 28, 25, 15, 0,19, 22, 15, 37, 14, 12, 33, 5, 29, 39, 45, 7, 41, 25, 13, 32, 33, 10,16, 39, 0, 0, 3, 18, 24, 7, 20, 7, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 25, 0, 0, 0, 0, 0, 0, 0, −28, 19, 10, 0, 0, 0, 16, 0, 4,14, 10, 10, 22, −9, 0, 18, 4, 0, 0, 3, 8, 2, 4, 14, 8, 2, 2, 6, 3, 5, 8,8, 25, 25, 14, 4, 5, −8, 2, 13, 12, 16, 2, 2, 5, 9, 6, 2, 9, 6, 10, 11,3, 8, 4, 4, 4, 5, 3, 3, 0, 0, 0, 3, 0, 10, 0, 4, 0, 0, 0, 0, 5, 10, 0,0, 0, 0, 9, 6, 13, 0, 1, −9, 8, 12, 13, 26, 11, 9, 45, −16, −5, −25, 4,0, 18, 11, 0, 34, 14, 1, 0, 0, 0, 0, 1, 4, 8, 1, 0, 3, 5, 22, 14, 37,11, 16, 9, 7, 0, 45, 10, 6, 16, 45, 22, 0, 28. There is no evidentuniformity or pattern. The probe set was not designed to form a patternalong the molecule.

Similarly to the gap sizes, probe length is variable.

FIG. 5 gives a histogram of probe lengths for the set designed againstthe specific RNA. The histogram of gaps and the histogram of probelengths reveal important properties of probe sets of the invention. Thegaps are not uniform. In fact, the distribution of gaps is not regularor standard.

Note from the histogram that there is at least one gap above 100. Themode is likely zero. Also, there are negative values. Similarly, theprobe lengths include at least one at 44 and an apparent mode just above20. The histograms exhibit such patterns because the probes were notdesigned to a gap or length specification. In fact, biophysicalproperties may inform not just sequence selection but also probe length.For example, high GC oligos may be allowed to be made at the higher endwithin the Tm boundaries, but also may be made shorter than low GColigos. Algorithmically, the system may seek to minimize spacing betweenadjacent heteroduplexes, yet nevertheless give such a histogram (witha >100 value).

Using probes of the disclosure, one may perform RNA depletion byhybridizing a plurality of DNA oligos to a target RNA molecule(s) in asample to form heteroduplexes, wherein the DNA oligos have sequencesselected to (i) give the heteroduplexes melting temperatures within apredetermined range, and (ii) minimize matches to reference off-targetRNAs; and digesting RNA in the heteroduplexes. These steps may beperformed within a single-cell RNA-Seq (scRNA-Seq) workflow. E.g., themethod may include isolating single cells into a fluid partition,releasing nucleic acid from the cell within the partition (e.g., bychemical or thermal cell lysis), and depleting the target RNAmolecule(s). The fluid partition may be a droplet, e.g., in an emulsionor microfluidic device, or a well, e.g., in a plate such as a multiwellplate or picotiter plate. The method may include introducing reactionreagents (e.g., any of lysis reagents, depletion probes of theinvention, RNase H, capture oligos, primers, template-switching oligos,tagmentation reagents, reverse transcriptase, dNTPs, etc.) at dropletformation, by pipetting into wells, or by droplet merger. Any of theprimers or oligos may include barcodes, sequencing adaptors, universalpriming sites, etc. For example, in some embodiments, every RNAtranscript is captured by an oligo that includes a unique ornearly-unique sequence to function as a unique molecular identifier andevery cDNA is given a partition- or cell-barcode.

Using depletion probes of the invention, heteroduplexes are formed andmay exhibit a non-uniform distribution along the RNA molecule. The RNaseH digests the target RNA. Preferably the DNA oligos (i.e., the depletionprobes) have been tested and shown not to hybridize stably to proteincoding transcripts in a standardized human reference RNA extract. Afterdigestion, superabundant RNAs such as ribosomal or globin aresubstantially removed. Remaining transcripts are copied into cDNA(optionally barcoded by molecule and/or cell). Downstream method stepsmay further copy or amplify the cDNA molecules. For example, the cDNAmolecules may be amplified to create a sequencing library, e.g., withsequencing adaptors such as Illumina Y-adaptors at ends of the librarymembers.

Methods described herein are useful to provide a set of RNA depletionprobes that includes a plurality of DNA oligos able to hybridize to atarget RNA molecule to form heteroduplexes. The DNA oligos havesequences selected to (i) give the heteroduplexes melting temperatureswithin a predetermined range, and (ii) minimize matches to off-targetRNA. Preferably the heteroduplexes will form with a non-uniformdistribution along the RNA molecule, e.g., spacing between theheteroduplexes may be highly variable and/or mathematically random. Theprobes may be designed algorithmically, even automatically, by acomputer system operable for generating a set of candidate sequencescomplementary to the RNA molecule with predicting melting temperatureswithin the predetermined range; assigning a cost function to each of thecandidate sequences based at least in part on a match score to thereference off-target RNAs; and selecting a set of the candidatesequences that map to positions along the RNA molecule, wherein the setis selected by an algorithm that minimizes: cumulative cost function,inter-position gaps, and overlap. The computer system may output theselected set as any suitable file such a FASTA file, a file formattedfor input into an oligonucleotide synthesis instrument, or in a formataccepted by a commercial oligonucleotide vendor such as Integrated DNATechnologies (IDT). Even in some embodiments, the computer system mayinclude an application programming interface that selects the probe setand transmits a well-formed IDT order over internet protocols. The probeset, once synthesized, may be provided in packaging for use in alaboratory or for shipping or sale. For example, the probe set may belyophilized in a tube or in solution in a tube, where a tube can be anysuitable tube such as a 1.5 mL microcentrifuge tube such as those soldunder the trademark EPPENDORF. The probe set may then be used in methodsof performing RNA depletion as described above.

As described above, the probe set may be selected by methodologies thatuse Bloom filters, BLAST filters, or BLAST scoring. Preferred BLASTscoring embodiments are detailed above. A Bloom filter embodiment isdescribed in the examples. Besides the Bloom filter, BLAST filter orBLAST score methodologies, or any other suitable methodology may be usedto provide depletion probes. The invention provides sets of RNAdepletion probes, short DNA oligos that hybridize along the length of atarget RNA and mediate digestion of the target RNA by RNase H to removesuper-abundant RNA molecules from a sample. Depletion probes accordingto the invention are designed foremost based on biochemistry and thebiophysical properties of the probes so that all of the depletion probesof a set exhibit substantially uniform, consistent behavior in bindingto a target RNA in a sample. Probes are principally designed to specificperformance targets and biophysical properties, yielding probe sets withirregular, even essentially random, spacing along a target RNA molecule.

EXAMPLES Example 1. Bloom Filters

FIG. 6 diagram a workflow for a Bloom filter design methodology. Theworkflow begins with setting a target RNA and having a RefSeq data set.For example, a package such a BioPerl may be used to open a handle tothe most current version of RefSeq as it was described in Pruitt, 2007,NCBI reference sequences (RefSeq): a curated non-redundant sequencedatabase of genomes, transcripts and proteins, Nucleic Acids Res35(Database):D61-5, incorporated by reference. The target RNA isstripped out of (or masked off in) the RefSeq.

Additionally, the target is kmerized, or parsed into kmers such as allpossible kmers. For the skilled artisan it is trivial to write a scriptsuch a Perl script that writes all possible kmers to a FASTA file.Depending on platform, it may be preferable to convert the FASTA file toa binary alignment map (BAM) for downstream processing, e.g., by thePathSeq pipeline.

Using a Bloom filter, search kmers from RefSeq in target. This may beperformed using the PathSeq GATK pipeline to build a Bloom Filterpackage that identifies kmers present in a target sequence, that are notpresent elsewhere in RefSeq.

A Bloom filter is a space-efficient probabilistic data structure usefulto test whether an element is a member of a set. False positive matchesare possible, but false negatives are not. An empty Bloom filter is abit array of m bits, all set to 0. There must also be k different hashfunctions defined, each of which maps or hashes some set element to oneof the m array positions, generating a uniform random distribution.Typically, k is a small constant which depends on the desired falseerror rate F, while m is proportional to k and the number of elements tobe added. To add an element, feed it to each of the k hash functions toget k array positions. Set the bits at all these positions to 1. Toquery for an element (test whether it is in the set), feed it to each ofthe k hash functions to get k array positions. If any of the bits atthese positions is 0, the element is definitely not in the set; if itwere, then all the bits would have been set to 1 when it was inserted.If all are 1, then either the element is in the set, or the bits have bychance been set to 1 during the insertion of other elements, resultingin a false positive. In a simple Bloom filter, there is no way todistinguish between the two cases, but more advanced techniques canaddress this problem. See Najam, 2019, Pattern matching for DNAsequencing data using multiple Bloom filters, Biomed Res Int 14:7074387,incorporated by reference.

The Bloom filter is applied to the kmers. For example, the Bloom filtermay be used to simply exclude kmers that are found in RefSeq.Alternatively, the Bloom filter may be used to find kmers in RefSeq andto assign scores to those (e.g., some constant times % match or someother such arbitrary score).

With the set of kmers in a file or data structure, the system tiles aset of the kmers across the target RNA. This may be performed as a graphtraversal, which is well suited to performance by computers even forlarge (including very, very large) data quantities due to the unique waygraph platforms such as Neo4J access hardware, e.g., with index-freeadjacency. The system tiles the RNA sequence. Tiling an (RNA) sequencecan be stated as the problem of traversing a directed graph where edgeshave cost. In some embodiments, initially the cost of an edge is thedistance (in bases) between the end of probe and the beginning of thenext. The lowest cost path from the beginning of the RNA to the end canbe most efficiently (and optimally) calculated using Dijkstra'salgorithm (see e.g., Rodriguez-Puente, 2013, Algorithm for shortest pathsearch in geographic information systems by using reduced graphs,Springerplus 2:291, incorporated by reference). Dijkstra runs inO[|V|+|E|*log(|V|)]. Using the simple heuristic of only consideringprobes that live within a fixed distance of each other (as opposed toall to all), |V|∈O (|E|)—so the runtime complexity is O[|E|*log(|E|)].

Dijkstra's algorithm was implemented in a general fashion that allows itto operate on any graph object, for which the nodes can hold any datatype. It is noted that in the first implementation of the tilingalgorithm, the graph was constructed such that the distance from probe xto probe y is start(y)−stop(x). This works, but Dijkstra is a greedyalgorithm, and may leave gaps at the end of the tiling path (i.e. whenthere are multiple optimal solutions, it will pick one where the costsare incurred late in the optimization, and it generally accumulatesthese in longer stretches). This is fixed relatively easily by settingthe distance between nodes to be: [start(y)−stop(x)]{circumflex over( )}2, so that the cost grows faster when a single gap is extendedrelative to opening a number of short gaps. Additionally oralternatively it may be suitable to assign a gap extension penaltydistinct from a gap formation penalty.

It is noted that the graph algorithm may read from initial probecandidates that cover every base in target (e.g., RNA18SN3) yet leavelong stretches uncovered. For RNA18SN3, all filtered 21-mer covers allbut 4 bases, yet two Dijkstra tilings leave 223 and 222 bases uncovered,respectively. Because the probes are non-overlapping and there is onlyone probe starting at each position, filtering out the probes makes ithighly unlikely that a contiguous coverage path can be found.Interestingly this need not be a restriction on applicability and theresulting probe set may be used for RNA depletion. It is noted that withthe existing algorithm, when using an unfiltered sets of probes (i.e.when every possible probe is fed into the tiler) the return is anoptimal path with just a small coverage gap at the end (if the length ofthe tiled sequence is not a multiple of the length of the probes, whichare all fixed in this case). Inputs that contain a proper tiling returnthe same input. Manual curation of results may be helpful but manualcuration of results appears to confirm that suitable results areobtained.

Example 2. Design Workflow

FIG. 7 describes an example of a method of Human/Mouse/Rat (H/M/R) probedesign.

The method includes: algorithmic design of depletion probes to humantranscript target; removing probes with high homology to undesired humantranscript targets; algorithmic design of depletion probes to mouse/rattranscript targets; removing probes with high homology to undesiredmouse/rat transcript targets; cross checking human probes to mouse/rattranscriptome and vice versa; and remove probes with high homology toundesired transcript targets in H/M/R. After such design methodology,the final probe set may be synthesize and used to deplete unwanted RNAfrom a sample or packaged for storage, shipping, distribution, or lateruse. For example, the probes may be placed in a tube or vial, in anaqueous solution or lyophilized for later reconstitution. Such vial ortube may optionally include other reagents such as RNase H.

Example 3. Results

FIG. 8 shows ribosomal RNA depletion on total RNA derived from wholeblood. The left bar, “before RNA depletion”, shows that the samplewas >80% ribosomal RNA. After rRNA depletion, as shown by the bar on theright, the sample was <5% rRNA and approximately 75% mRNA. Thus theseresults show that probe sets and methods of the disclosure are effectivefor removing ribosomal RNA from samples.

FIG. 9 shows the result of hemoglobin RNA depletion from whole bloodusing a probe set and methods of the disclosure. Before hemoglobindepletion, as shown by the bar on the left, the sample included about35.75% residual hemoglobin RNA. After depletion, the sample includedabout 0.01% residual hemoglobin RNA. Thus these results show that probesets and methods of the disclosure are effective for removing hemoglobinRNA from samples.

FIG. 10 shows that methods and probe sets of the invention are usefulfor robust ribosomal RNA depletion from total RNA samples derived fromformalin-fixed, paraffin embedded (FFPE) tissue. For 4 blocks each, with10 ng and 400 ng total RNA per block being obtained, in all cases, %ribosomal RNA after depletion was lower than or equal to about 7% and,in most cases, was significantly lower than 5%. Thus these results showthat probe sets and methods of the disclosure are effective for removingRNA from FFPE samples.

FIG. 11 gives quantitative results of ribosomal depletion from total RNAderived from universal mouse and rat references. Using methods and probesets of the invention, the mouse sample was depleted, from about 85%ribosomal RNA down to about less than 5% ribosomal RNA. The rat samplehad about 80% ribosomal RNA original and methods and probe sets of theinvention depleted the rat sample down to less than about 5% ribosomalRNA. Thus these results show that probe sets and methods of thedisclosure are effective for removing RNA from rat or mouse samples.

What is claimed is:
 1. A method for depleting RNA, the methodcomprising: exposing a plurality of DNA oligos to a sample, wherein theDNA oligos form heteroduplexes with a specific subset of RNA in thesample within a predetermined range of melting temperatures but do notsubstantially interact with RNA that are not part of the subset; anddigesting RNA in the formed heteroduplexes.
 2. The method of claim 1,wherein a plurality of the DNA oligos form heteroduplexes along a lengthof RNA molecules in the subset.
 3. The method of claim 2, wherein theDNA oligos form heteroduplexes in a non-uniform distribution along RNAmolecules in the subset.
 4. The method of claim 1, wherein the DNAoligos form heteroduplexes that tile along the length of members of thesubset of RNA.
 5. The method of claim 4, wherein the oligos are designedto minimize spacing between adjacent heteroduplexes.
 6. The method ofclaim 1, wherein the predetermined range of melting temperatures isoptimized to specific sequences of the DNA oligos to maximize binding toRNA in the subset.
 7. The method of claim 1, wherein the sequences ofthe DNA oligos are selected by: generating a set of candidate sequencescomplementary to RNA in the subset with a melting temperature within thepredetermined range; assigning a cost function to each of the candidatesequences based at least in part on a match score to referenceoff-target RNAs; and selecting a set of the candidate sequences thatminimizes the cost function, inter-position gaps, and overlap.
 8. Themethod of claim 1, wherein the subset of RNA is selected from the groupconsisting of ribosomal RNA, a globin transcript, and mitochondrial RNA.9. The method of claim 1, wherein the digesting step includes treatingthe sample with RNAse H.
 10. The method of claim 1, wherein the methodincreases a ratio of poly-A tailed RNA to non-coding RNA in the sample.11. The method of claim 1, wherein spacing between the heteroduplexes ismathematically random.
 12. The method of claim 1, wherein sequences ofat least two of the DNA oligos are selected to hybridize to the subsetof RNA at locations that overlap or abut.
 13. The method of claim 1,wherein the DNA oligos have lengths that vary from about 18 bases toabout 44 bases.
 14. The method of claim 1, wherein the DNA oligos arepresent in the mix at non-equimolar concentrations.
 15. The method ofclaim 9, wherein the RNA, the DNA oligos and the Rnase H are addedtogether in a single tube before the hybridization and digestion iscarried out.